Set resource limit for addon containers #10653

dchen1107 · 2015-07-01T23:01:54Z

Measurement is based on current default configuration for GCE / GKE (?) and AWS: 4 nodes, 30~50 pods per node.

For all containers, I allocate 100m for cpu even they all use less than that since it matches today's default setting from LimitRange of default namespace. Memory is very complicated here:

all containers in dns service is very stable, and use tiny bit of memory (< 5Mi), but I rounded them up to 50Mi anyway.
Looks like fluentd have memory leakage issue over time, but since it is stateless, it is safe to be oom-killed for now. But it is static pod for now, I decided to do nothing on it since it is too hard to change static pods' configuration today.
We observed both influxdb and heapster memory leakage issues. Witthout any new pods scheduled to cluster, over one night, the memory usage of influxdb is shooting to 2.5Gi from 150Mi, and one of heapster to 2Gi from 150Mi. I chose 200Mi now for both. Influxdb now has an emtpy dir volume attached to pod, and could be recovered.
We observed elasticsearch and kibana have memory leakage issue. Since elasticsearch and kibana are not started by default for GCE and GKE, I decided to not give memory limit for now.

cc/ @bgrant0607 @thockin @saad-ali @a-robinson

@eparis and @justinsb too on valgrant case.

…data collected by kubernetes#10335. Please noted that both influxdb and heapster could be oom-killed due to memory leakage here.

…ubernetes#10335

… on data collected by kubernetes#10335

bgrant0607 · 2015-07-01T23:14:06Z

How often do we expect heapster and influxdb to restart, then? And these numbers are for how many pods? Do we have scaling guidance for users?

If we want to ensure that all addons have limits set, we'll need a check, but that can be done later -- file an issue for that. That would also mean that elasticsearch and kibana would need to get memory limits.

Otherwise, LGTM

bgrant0607 · 2015-07-01T23:14:57Z

#10077 was merged. The UI will need limits, presumably.

k8s-bot · 2015-07-01T23:19:24Z

GCE e2e build/test passed for commit 6b61918.

a-robinson · 2015-07-01T23:26:24Z

If we're going to run fluentd on the master (#10597), it should definitely have resource limits. As you mention, it should be safe to be oom-killed.

dchen1107 · 2015-07-01T23:46:31Z

@bgrant0607 The problem I had today is that both heapster and influxdb don't have any limit, and they are going to oom-killed when system oom-ed eventually, so no, I don't have that restart frequency number yet. I expected to have this pr in, so that I can get that number.

The numbers I picked up here is based on 4 nodes, 100 containers per node (30 pods, and 2 containers each). We can have scaling guidance for users based on that.

dchen1107 · 2015-07-01T23:50:07Z

Can we fix UI in a separate PR since I really want to collect heapster and influxdb stats as earlier as possible here?

@a-robinson fluentd has slight memory leakage overtime. I will send another pr to resolve fluentd issue soon.

a-robinson · 2015-07-01T23:51:45Z

Ack, thanks. Its memory leak would seem to make it even more important to put a cap on its memory usage.

dchen1107 · 2015-07-02T00:02:56Z

@a-robinson I pushed an extra commit to update fluentd manifest file too. For UI one, we have to wait for measurements, thus separate PR. :-)

a-robinson · 2015-07-02T00:19:43Z

cluster/saltbase/salt/fluentd-gcp/fluentd-gcp.yaml

@@ -9,6 +9,7 @@ spec:
    resources:
      limits:
        cpu: 100m
+        memory: 200Mi


Isn't 100Mi more than enough given that in all of Saad's tests it never used that much?

no, Saad's report in that issue is only for a couple of hours. The limit I chose here is based on over 2 days. Both of us are running the soaking test against our default configurations.

To be clear my report was over the course of 24 hours. It looks like the fluentd container with elasticsearch plugin has a leak, because memory usage continues to grow. After two days it hit 151 MB. See #10335 (comment)

Why is the memory limit set for fluentd-gcp but not fluentd-es?

a-robinson · 2015-07-02T00:20:55Z

Sorry for all the comments, but you should also add limits to cluster/saltbase/salt/fluentd-es/fluentd-es.yaml while you're adding them to fluend-gcp.

thockin · 2015-07-02T00:21:55Z

LGTM

k8s-bot · 2015-07-02T00:27:00Z

GCE e2e build/test passed for commit 54531d9.

dchen1107 · 2015-07-02T00:28:45Z

@a-robinson Like we mentioned above, cluster/saltbase/salt/fluentd-es/fluentd-es.yaml is not used by GKE and GCE, I would rather not choose the limit for them. My concern is reducing the schedulability for those small cells.

dchen1107 · 2015-07-02T00:31:22Z

Since I have two TL's lgtm now, I marked it ok-to-merge. The reason I am rushing for this in is that we are missing important data for heapster and influxdb, for example, restart frequency, etc.
I will monitor jenkin tonight, if it causes any trouble, feel free to revert it later. Thanks!

dchen1107 · 2015-07-02T01:10:58Z

@zmerlynn or @nikhiljindal Can I have merge for this one? I am going to monitor jenkin tonight and tomorrow. Thanks!

bgrant0607 · 2015-07-02T01:26:44Z

LGTM

zmerlynn · 2015-07-02T02:35:27Z

I'm watching tonight as well.

Set resource limit for addon containers

dchen1107 · 2015-07-02T03:11:58Z

Thanks, everyone! You guys are super!

dchen1107 · 2015-07-06T17:10:46Z

#10335

dchen1107 · 2015-07-10T19:51:10Z

#10760 for memory leakage of heapster and influxdb

dchen1107 added 3 commits July 1, 2015 14:39

Set resource limit for both heapster and influxdb container based on …

4f2d222

…data collected by kubernetes#10335. Please noted that both influxdb and heapster could be oom-killed due to memory leakage here.

Set resource limit for skydns containers based on data collected by k…

ce520dd

…ubernetes#10335

Set resource limit for both elasticsearch and kibana containers based…

6b61918

… on data collected by kubernetes#10335

googlebot added the cla: yes label Jul 1, 2015

dchen1107 assigned bgrant0607 Jul 1, 2015

bgrant0607 added this to the v1.0 milestone Jul 1, 2015

bgrant0607 assigned thockin and unassigned bgrant0607 Jul 1, 2015

dchen1107 mentioned this pull request Jul 1, 2015

Set minimal shares for containers with no cpu specified #10655

Merged

Set memory limit (200mi) to fluentd static pod

54531d9

a-robinson reviewed Jul 2, 2015
View reviewed changes

thockin added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 2, 2015

dchen1107 added the ok-to-merge label Jul 2, 2015

zmerlynn added a commit that referenced this pull request Jul 2, 2015

Merge pull request #10653 from dchen1107/resource_management

1d16be6

Set resource limit for addon containers

zmerlynn merged commit 1d16be6 into kubernetes:master Jul 2, 2015

a-robinson mentioned this pull request Jul 3, 2015

Run fluentd on the master to collect the core master logs #10597

Merged

This was referenced Jul 6, 2015

Scalability tests failing on Jenkins due to non starting pods #10282

Closed

Measure kubernetes addons resource usage #10335

Closed

dchen1107 mentioned this pull request Jul 6, 2015

Set resource limit for kube-ui addon container. #10778

Merged

erictune mentioned this pull request Jul 6, 2015

Scheduler needs to deal with pods without resource limits #10242

Closed

dchen1107 mentioned this pull request Jul 6, 2015

Resource management guidance for addons containers #10779

Closed

davidopp mentioned this pull request Jul 7, 2015

Increase zero-limit pod RAM for spreading to 200 MB to match cluster #10857

Merged

dchen1107 mentioned this pull request Jul 9, 2015

Idle cluster, CPU usage increasing #10659

Closed

dchen1107 mentioned this pull request Jul 10, 2015

v1.0.0 known issues / FAQ accumulator #10760

Closed

a-robinson mentioned this pull request Aug 5, 2015

Bump the heapster pod's memory limit from 200MiB to 300MiB. #12314

Merged

dchen1107 mentioned this pull request Aug 21, 2015

elasticsearch-logging pods each consumes ~500MB memory #13051

Closed

neo-feng mentioned this pull request Feb 12, 2019

elasticsearch-logging pod内存占用过大 easzlab/kubeasz#454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set resource limit for addon containers #10653

Set resource limit for addon containers #10653

dchen1107 commented Jul 1, 2015

bgrant0607 commented Jul 1, 2015

bgrant0607 commented Jul 1, 2015

k8s-bot commented Jul 1, 2015

a-robinson commented Jul 1, 2015

dchen1107 commented Jul 1, 2015

dchen1107 commented Jul 1, 2015

a-robinson commented Jul 1, 2015

dchen1107 commented Jul 2, 2015

a-robinson Jul 2, 2015

dchen1107 Jul 2, 2015

saad-ali Jul 2, 2015

saad-ali Jul 7, 2015

a-robinson commented Jul 2, 2015

thockin commented Jul 2, 2015

k8s-bot commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

bgrant0607 commented Jul 2, 2015

zmerlynn commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

dchen1107 commented Jul 6, 2015

dchen1107 commented Jul 10, 2015

Set resource limit for addon containers #10653

Set resource limit for addon containers #10653

Conversation

dchen1107 commented Jul 1, 2015

bgrant0607 commented Jul 1, 2015

bgrant0607 commented Jul 1, 2015

k8s-bot commented Jul 1, 2015

a-robinson commented Jul 1, 2015

dchen1107 commented Jul 1, 2015

dchen1107 commented Jul 1, 2015

a-robinson commented Jul 1, 2015

dchen1107 commented Jul 2, 2015

a-robinson Jul 2, 2015

Choose a reason for hiding this comment

dchen1107 Jul 2, 2015

Choose a reason for hiding this comment

saad-ali Jul 2, 2015

Choose a reason for hiding this comment

saad-ali Jul 7, 2015

Choose a reason for hiding this comment

a-robinson commented Jul 2, 2015

thockin commented Jul 2, 2015

k8s-bot commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

bgrant0607 commented Jul 2, 2015

zmerlynn commented Jul 2, 2015

dchen1107 commented Jul 2, 2015

dchen1107 commented Jul 6, 2015

dchen1107 commented Jul 10, 2015